TD - Chapter details v2

Chapter details

The Bulgarian language processing chain consists of: BG sentence splitter, BG tokenizer, BG Part of Speech (PoS) tagger, BG lemmatizer, BG Noun phrase (NP) extractor, BG Named entity (NE) recognizer, BG Word sense (WS) disambiguator and BG Stop word (SW) recognizer. All tools are self contained and designed to work in a chain, i.e. the output of the previous component is the input for the next component, starting from the sentence splitter and following the strict order for the tokenizer, PoS tagger, and lemmatizer. The rest of the tools use the lemmatizer output and there is no dependencies in their execution order. Each tool associates tokens from the input text with different sets of annotation tags. The tools exchange data among the chain using so called vertical format.

The common form of the vertical format is:

tok1 Tag1 Tag2 Tag3 ..... Tagk
tok2 Tag1 Tag2 Tag3 ..... Tagk
...................... 
tokn Tag1 Tag2 Tag3 ..... Tagk

In the vertical format the tokens are separated by a newline, whereas the annotation tags – by a tab character (\t). One and the same tool can assign tags with a complex structure (marked with delimiters), different types of annotation separated in different columns as well as an annotation to a group of tokens. Each tool accumulates tags in fixed positions at one or several columns.

BG sentence splitter

The sentence splitter marks the sentence boundaries in raw Bulgarian text. The sentence splitter applies regular rules and lexicons. Both - regular rules and lexicons - are manually crafted by an expert. For example the general rule for sentence splitting is: ([\.\?\!]\n?\s*)(\-?\s?[А-Я1-9A-Z]). Lists of lexicons (for recognizing abbreviations after which there must be or there might be a capital letter, a number, etc. in the middle of the sentence) are applied before the regular rules. The lexicons are compiled by a separate tool - the Lexicon compiler, as minimal acyclic final state automatа which allows an effective processing.

BG tokenizer

The Bulgarian tokenizer demarcates strings of letters, numbers, punctuation marks, special symbols, combinations of them and empty symbols. Regular patterns are used to recognize some simple cases of named entities that mean dates, fractions, emails, internet addresses, abbreviations, etc. The tokenizer classifies each recognized token (for example: small cyrillic letters, capital latin letters, etc.). The tokenizer utilizes finite state transducers for token recognition and type matching. The token demarcating and token classifying rules are defined and compiled as finite state transducers with a separate tool – the ParseEst.

For example if the text: ‘Писмо до Ижан Йоцов от Враца. То е последното Ботево писмо.’ (A letter written to Ivan Yotsov from Vratsa. This is the last letter written by Botev.) is passed through the tokenizer, the output in a vertical format will be:

 Писмо      TOK_FUCA      0,5
 до      TOK_LCA      6,2
 Иван      TOK_FUCA      9,4
 Йоцов      TOK_FUCA      14,5
 от      TOK_LCA      20,2
 Враца      TOK_FUCA      23,5
 .      TOK_FS      28,1
 То      TOK_FUCA      30,2
 е      TOK_LCA      33,1
 последното      TOK_LCA      35,10
 Ботево      TOK_FUCA      46,6
 писмо      TOK_LCA      53,5
 .      TOK_FS\t58,1

Example of the output format of the BG tokenizer

Here the first column contains the graphical representation of tokens, the second column consists of the associated token tags and the third column represents the position and length of tokens.

BG PoS tagger

The Bulgarian PoS tagger marks up each word with the most probable Part of Speech and unambiguous morpho-syntactic information among the set of tags associated with a given word. The tagger is based on SVM (Support Vector Machines) learning. The tagger predicts the PoS tag of a word based on a set of features describing the word and its context. These features are words, word bigrams and trigrams within a window of words around the currently tagged word; PoS tags, PoS tags bigrams and trigrams in the current window, and information about suffixes, prefixes, capitalization, hyphenation etc. for the unknown words. The tagger is trained and tested on manually PoS disambiguated corpus. The strategy chosen for training Bulgarian tagger is two passes in both directions; a window of five tokens, the currently tagged word being on the second position; two and three-grams of words or tags or ambiguity classes, lexical parameters as prefixes, suffixes, sentence borders, and capital letters. The trained model is applied to disambiguate texts. The precision of the tagger up to the moment is 96,58%. The tagger exploits the SVMTool, an open source utility for training of tagger models and their application for PoS disambiguation. To improve the robustness of the SVMtool an alternative disambiguation module has been developed in C++. The new implementation provides an integration with the lower levels of annotation, full unicode support and improves the model loading speed.

The BG PoS tagger is executed in two modes:

‘Command line’ mode in which the tagger is run with a command line argument containing the name of the input file. The generated output is returned to the standard output.
‘Server’ mode in which the tagger listens on a TCP socket for client connections. The input and output data are provided by the TCP socket. Concurrent client connections are allowed.

BG lemmatizer

The Bulgarian lemmatizer determines for a given word form its lemma and detailed morpho-syntactic annotation. The lemmatization is based on an unambiguous association between the tagger output and information encoded in a large grammatical dictionary of Bulgarian language. At the tagging a reduced tagset is used (75 word classes compering to 1029 unique grammatical tags in the dictionary) compiled in a way that the minimum necessary information for unambiguous association with the respective lemma to be ensured. A small number of rules and preferences are also implemented to limit the ambiguity in lemmatization. The grammatical dictionary is represented as a finite state automaton which itself provides a very efficient lookup. The dictionary is part of the executable file.

BG NP extractor

Bulgarian NP extractor recognizes and annotates noun phrases and their heads in the output text. The extractor is rule based parser and exploits a manually crafted grammar designed according to the following criteria: to recognize unambiguous phrases, to exclude pronouns as modifiers as well as relative clauses. The rules are defined in ParseEst XML based formalism as context-free or context-dependent rules, unlimited to the number of constituents, based on the part of speech tags and values of grammatical categories of word forms and providing annotation for phrase boundaries and heads. As a result an extensive number of noun phrases and their heads are unambiguously annotated - the number varies in different types of texts. The generic tool used for Bulgarian (as well as for English) NP extraction is ParseEst, a system for compiling and processing linguistic rules. It consists of two modules: lr_builder and lr_engine. ParseEst lr_builder compiles linguistics rules into a finite state transducers where the input is an xml file with rules definitions and the output - compiled finite state transducer and meta symbol definitions. For a given rule a finite state transducer is constructed via the ParseEst. For each rule group there is a corresponding transducer, constructed by composition of all single rule transducers. The resulted transducers are composed according to their priority to result in a single transducer at the end. The transducers are applied with the ParsesEst lr_engine over a lemmatized text and they add syntactic information represented by means of annotations, such as brackets and labels for noun phrase head.

BG NE recognizer

The tool recognizes and marks different types of named entities (NE) in the input text. Named entity recognition is performed by the generic tool ParseEst for compiling and processing linguistic rules. The rules are defined in ParseEst XML based formalism as context-free rules, based on different kind of input information (words, lemmas, part of speech tags and values of grammatical categories of lemmas or word forms, lexicons), and providing annotation for named entity boundaries and types. ParseEst lr_builder compiles linguistics rules into a finite state transducers where the input is an xml file with rules definitions and the output - compiled finite state transducer and meta symbol definitions. The transducers are applied with the ParsesEst lr_engine over a lemmatized text. Both - lr_builder and lr_engine operate with the compiled lexicons produced by a separate tool - the Lexicon compiler. As a result NE defined to cover dates, money, percentage and time expressions, names of organizations, locations and persons are recognized.